Skip to content

BUG: read_json does not respect chunksize #38293

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Dec 23, 2020

Conversation

robertwb
Copy link
Contributor

@robertwb robertwb commented Dec 4, 2020

@mroeschke
Copy link
Member

Thanks. Mind adding a unit test and whatsnew entry for version 1.2

@mroeschke mroeschke added the IO JSON read_json, to_json, json_normalize label Dec 6, 2020
@jreback jreback added the Bug label Dec 8, 2020
@jreback jreback added this to the 1.2 milestone Dec 8, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah ideally this change should have broken a test (obviously) we don't have one. pls add a test which breaks on master and this change fixes (look in the original issues, where we want to replicate)

@jreback jreback removed this from the 1.2 milestone Dec 13, 2020
@robertwb
Copy link
Contributor Author

I've added a test and addressed the comments.

@jreback jreback added this to the 1.2 milestone Dec 22, 2020
@jreback
Copy link
Contributor

jreback commented Dec 22, 2020

can you also post the benchmarks for line read json (should show that this improves vs master)

@robertwb
Copy link
Contributor Author

I don't expect it to improve raw performance, it's about memory usage for large files.

@jreback
Copy link
Contributor

jreback commented Dec 22, 2020

I don't expect it to improve raw performance, it's about memory usage for large files.

we have memory benchmarks to measure this

@jreback
Copy link
Contributor

jreback commented Dec 23, 2020

@robertwb if you can post the memory benchmarks

@simonjayhawkins
Copy link
Member

the output is BENCHMARKS NOT SIGNIFICANTLY CHANGED. but IIUC io.json.ReadJSONLines.peakmem_read_json_lines_concat is reduced from 206M to 188M

maybe we need to change the benchmark to see a bigger effect.

(pandas-dev) simon@WIN-7LJRFMBC2ME:~/pandas/asv_bench$ asv continuous -f 1.1 master HEAD -b peakmem_read_json
· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py3.8-Cython0.29.21-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt.
·· Building 4b856ee8 <json-chunksize> for conda-py3.8-Cython0.29.21-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt.........................................
·· Installing 4b856ee8 <json-chunksize> into conda-py3.8-Cython0.29.21-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt.
· Running 6 total benchmarks (2 commits * 1 environments * 3 benchmarks)
[  0.00%] · For pandas commit 4b856ee8 <json-chunksize> (round 1/1):
[  0.00%] ·· Benchmarking conda-py3.8-Cython0.29.21-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 16.67%] ··· io.json.ReadJSONLines.peakmem_read_json_lines                                                           ok
[ 16.67%] ··· ========== ======
                index
              ---------- ------
                 int      250M
               datetime   250M
              ========== ======

[ 33.33%] ··· io.json.ReadJSONLines.peakmem_read_json_lines_concat                                                    ok
[ 33.33%] ··· ========== ======
                index
              ---------- ------
                 int      188M
               datetime   188M
              ========== ======

[ 50.00%] ··· io.json.ReadJSONLines.peakmem_read_json_lines_nrows                                                     ok
[ 50.00%] ··· ========== ======
                index
              ---------- ------
                 int      188M
               datetime   188M
              ========== ======

[ 50.00%] · For pandas commit 2733a109 <master> (round 1/1):
[ 50.00%] ·· Building for conda-py3.8-Cython0.29.21-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt......................................
[ 50.00%] ·· Benchmarking conda-py3.8-Cython0.29.21-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 66.67%] ··· io.json.ReadJSONLines.peakmem_read_json_lines                                                           ok
[ 66.67%] ··· ========== ======
                index
              ---------- ------
                 int      248M
               datetime   248M
              ========== ======

[ 83.33%] ··· io.json.ReadJSONLines.peakmem_read_json_lines_concat                                                    ok
[ 83.33%] ··· ========== ======
                index
              ---------- ------
                 int      206M
               datetime   206M
              ========== ======

[100.00%] ··· io.json.ReadJSONLines.peakmem_read_json_lines_nrows                                                     ok
[100.00%] ··· ========== ======
                index
              ---------- ------
                 int      195M
               datetime   195M
              ========== ======


BENCHMARKS NOT SIGNIFICANTLY CHANGED.

@jreback
Copy link
Contributor

jreback commented Dec 23, 2020

the output is BENCHMARKS NOT SIGNIFICANTLY CHANGED. but IIUC io.json.ReadJSONLines.peakmem_read_json_lines_concat is reduced from 206M to 188M

maybe we need to change the benchmark to see a bigger effect.

so how do we know that this is working (and the previous was not), sure this is a difference, but wouldn't you expect a HUGE diffrence here? (e.g. maybe the benchmark should be reading with chunksize=1 or something) vs no chunksize.

@simonjayhawkins
Copy link
Member

so how do we know that this is working (and the previous was not), sure this is a difference, but wouldn't you expect a HUGE diffrence here? (e.g. maybe the benchmark should be reading with chunksize=1 or something) vs no chunksize.

we do a concat of the chunks in the benchmark. maybe could just iterate through the chunks without the concat.

am currently running with chunksize=1 to compare.

@jreback
Copy link
Contributor

jreback commented Dec 23, 2020

so how do we know that this is working (and the previous was not), sure this is a difference, but wouldn't you expect a HUGE diffrence here? (e.g. maybe the benchmark should be reading with chunksize=1 or something) vs no chunksize.

we do a concat of the chunks in the benchmark. maybe could just iterate through the chunks without the concat.

am currently running with chunksize=1 to compare.

oh yeah the concat will increase memory :-> don't compare that

@simonjayhawkins
Copy link
Member

changing the chunksize to 10 (1 fails) make a bit more difference (to the memory used on master!)

       before           after         ratio
     [2733a109]       [d731dc87]
     <master>         <json-chunksize>
-            224M             189M     0.84  io.json.ReadJSONLines.peakmem_read_json_lines_concat('int')
-            224M             188M     0.84  io.json.ReadJSONLines.peakmem_read_json_lines_concat('datetime')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

@simonjayhawkins
Copy link
Member

oh yeah the concat will increase memory :-> don't compare that

presumably we do the concat to compare the results with peakmem_read_json_lines

    def peakmem_read_json_lines(self, index):
        read_json(self.fname, orient="records", lines=True)

@jreback
Copy link
Contributor

jreback commented Dec 23, 2020

ok fair. @simonjayhawkins wouldn't object to changing the benchmarks (but this PR is fine).

@jreback jreback merged commit 936d125 into pandas-dev:master Dec 23, 2020
@jreback
Copy link
Contributor

jreback commented Dec 23, 2020

thanks @robertwb

@jreback
Copy link
Contributor

jreback commented Dec 23, 2020

@meeseeksdev backport 1.2.x

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Dec 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO JSON read_json, to_json, json_normalize
Projects
None yet
4 participants